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Abstract 

Explaining adaptive behavior is a central problem in 
artificial intelligence research. Here we formalize adap- 
tive agents as mixture distributions over sequences of 
inputs and outputs (I/O). Each distribution of the mix- 
ture constitutes a 'possible world', but the agent does 
not know which of the possible worlds it is actually fac- 
ing. The problem is to adapt the I/O stream in a way 
that is compatible with the true world. A natural mea- 
sure of adaptation can be obtained by the Kullback- 
Leibler (KL) divergence between the I/O distribution 
of the true world and the I/O distribution expected by 
the agent that is uncertain about possible worlds. In 
the case of pure input streams, the Bayesian mixture 
provides a well-known solution for this problem. We 
show, however, that in the case of I/O streams this so- 
lution breaks down, because outputs are issued by the 
agent itself and require a different probabilistic syntax 
as provided by intervention calculus. Based on this 
calculus, we obtain a Bayesian control rule that allows 
modeling adaptive behavior with mixture distributions 
over I/O streams. This rule might allow for a novel 
approach to adaptive control based on a minimum KL- 
principle. 

Keywords: Adaptive behavior, Intervention calculus, 
Bayesian control, Kullback-Leibler-divergence 

Introduction 

The ability to adapt to unknown environ ments is of- 
ten considered a hallmark of intelligence [Beerl . 1199(1 
iHutterl . l2004j . Agent and environment can be concep- 
tualized as tw o systems tha t exchange symbols in ev- 
ery time step [Hutterl . l2004j : the symbol issued by the 
agent is an action, whereas the symbol issued by the 
environment is an observation. Thus, both agent and 
environment can be conceptualized as probability dis- 
tributions over sequences of actions and observations 
(I/O streams). 

If the environment is perfectly known then the I/O 
probability distribution of the agent can be tailored to 
suit this particular environment. However, if the envi- 
ronment is unknown, but known to belong to a set of 
possible environments, then the agent faces an adap- 
tation problem. Consider, for example, a robot that 
has been endowed with a set of behavioral primitives 



and now faces the problem of how to act while being 
ignorant as to which is the correct primitive. Since we 
want to model both agent and environment as proba- 
bility distributions over I/O sequences, a natural way 
to measure the degree of adaptation would be to mea- 
sure the 'distance' in probability space between the I/O 
distribution represented by the agent and the I/O dis- 
tribution conditioned on the true environment. A suit- 
able measure (in terms of its information-theoretic in- 
terpretation) is readily provided by the KL-divergence 
|MacKavLl2(Ki^ . In the case of passive prediction, the 
adaptation problem has a well-known solution. The 
distribution that minimizes the KL-divergence is a 
Bayesi an mixture distribution over all possible en viron- 
ments [Haussler and Opperl . Il997l lOpperL 119981 ]. The 
aim of this paper is to extend this result for distribu- 
tions over both inputs and outputs. The main result 
of this paper is that this extension is only possible if 
we consider the special syntax of actions in probability 
theory a s it has been suggested by proponents of causal 
calculus jPearil2000j . 

Preliminaries 

We restrict the exposition to the case of discrete time 
with discrete stochastic observations and control sig- 
nals. Let O and A be two finite sets, the first being 
the set of observations and the second being the set of 
actions. We use a<t = a\a-2 ■ ■ . a t , ao <t = a\0\ . . . atOt 
etc. to simplify the notation of strings. Using A and 
O, a set of interaction sequences is constructed. Define 
the set of interactions as Z = A x O. A pair (a, o) G Z 
is called an interaction. The set of interaction strings 
of length t > is denoted by Z t . Similarly, the set of 
(finite) interaction strings is Z* = [J t>0 Z 1 and the set 
of (infinite) interaction sequences is Z°° = {w : w — 
a\0\O2Oi . . .}, where each (at, Oj) G Z. The interaction 
string of length is denoted by e. 

Agents and environments are formalized as I/O sys- 
tems. An I/O system is a probability distribution Pr 
over interaction sequences Z°°. Pr is uniquely deter- 
mined by the conditional probabilities 

Pr(a t \ao <t ), Pr(o t \ao <t a t ) (1) 

for each ao <t G Z*. However, the semantics of the 



probability distribution Pr are only fully denned once 
it is coupled to another system. 

Let P, Q be two I/O systems. An interaction system 
(P, Q) is a coupling of the two systems giving rise to the 
generative distribution G that describes the probabili- 
ties that actually govern the I/O stream once the two 
systems are coupled. G is specified by the equations 

G{a t \ao<t) = P(at\ao <t ) 
G(o t \ao <t a t ) = Q{o t \ao <t a t ) 

valid for all aq t £ Z* . Here, G models the true proba- 
bility distribution over interaction sequences that arises 
by coupling two systems through their I/O streams. 
More specifically, for the system P, P(at\ao <t ) is the 
probability of producing action at € A. given history 
ao <t and P(o t \ao <t a t ) is the predicted probability of 
the observation o ( £ O given history ao <t at. Hence, 
for P, the sequence oio 2 . . . is its input stream and the 
sequence a\a-i ... is its output stream. In contrast, the 
roles of actions and observations are reversed in the 
case of the system Q. Thus, the sequence 0\02 ■ ■ ■ is 
its output stream and the sequence a±a2 ... is its in- 
put stream. This model of interaction is very general 
in that it can accommodate many specific regimes of 
interaction. Note that an agent P can perfectly predict 
its environment Q iff for all ao <t £ Z* , 

P(ot\ao <t a t ) = Q(o t \ao <t a t ). 
In this case we say that P is tailored to Q. 

Adaptive Systems: Naive Construction 

Throughout this paper, we use the convention that P is 
an agent to be constructed by a designer, which is then 
going to be interfaced with a preexisting but unknown 
environment Q. The designer assumes that Q is going 
to be drawn with probability P(m) from a set Q = 
{Qm}meM °f possible systems before the interaction 
starts, where M. is a countable set. 

Consider the case when the designer knows before- 
hand which environment Q £ Q is going to be drawn. 
Then, not only can P be tailored to Q, but also a 
custom-made policy for Q can be designed. That is, the 
output stream P(at\ao <t ) is such that the true proba- 
bility G of the resulting interaction system (P, Q) gives 
rise to interaction sequences that the designer considers 
desirable. 

Consider now the case when the designer does not 
know which environment Q.,„ £ Q is going to be drawn, 
and assume he has a set V = {P m }meM of systems such 
that for each m £ M., P m is tailored to Q m and the 
interaction system (P m , Q m ) has a generative distribu- 
tion G m that produces desirable interaction sequences. 
How can the designer construct a system P such that 
its behavior is as close as possible to the custom-made 
system P m under any realization of Q m £ Q? 

A convenient measure of how much P deviates from 
P m is given by the KL-divergence. A first approach 
would be to construct an agent P so as to minimize 



the total expected KL-divergence to P m . This is con- 
structed as follows. Define the history-dependent KL- 
divergences over the action a t and observation o t as 



D%{ao<t a t) = ^2'Pm(o t \ao <t at) log: 



P m {o t \ao <t at) 
Pc(o t \aq <t a t ) ' 



where Pr is a given arbitrary agent. Then, define the 
average KL-divergences over a t and ot as 

D Z = E P ™(ao <t at)D%(ao <t a t ). 



Finally, we define the total expected KL-divergence of 
Pr to P m as 

t 

D ee lim sup £P(m)£ (D% + D%). 

t — >OG 

m r— 1 

We construct the agent P as the system that minimizes 
D = D{Pr): 

P ee argminL>(Pr). (2) 

Pr 

The solution to Equation [2] is the system P defined by 
the set of equations 



P(a t \ao <t ) = ^ P m (at\ao <t )w m (aq <t ) 

m 

P(o t \aq <t a t ) = ^ P m (o t \ao <t a t )w m (ao <t a t ) 



(3) 



valid for all ao <t £ Z* , where the mixture weights are 

P{m)P m (ao <t ) 



w m (ao <t ) 



w m (ao <t a t ) 



£ m ,P(m')P m <(ao< t ) 
P{m)P m (ao <t a t ) 



(4) 



- <4 " W "£ m ' P(m>)P m ,(ao <t a t ) 

For reference, see lHaussler and Opperl [l997j |. lOpperl 
[1998]. It is clear that P is just the Bayesian mixture 
over the agents P m . If we define the conditional prob- 
abilities 

P(a t \m,ao <t ) = P m (a t \ao <t ) 
P(o t \m,ao <t a t ) = P. m (a t \ao <t a t ) 

for all ao <t £ Z*, then Equation [3] can be rewritten as 
P(a t \ao <t ) =^2P(a t \m,ao <t )P(m\ao <t ) 



P{o t \aq <t a t ) = 2j P(ot\m, ao<t a t)P( m \a2< 



(G) 



t a t) 



where the P{m\ao <t ) — w m {ao <t ) and P(m\ao <t at) — 
w m (QO < t a t) are just the posterior probabilities over the 



elements in M. given the past interactions. Hence, the 
conditional probabilities in Equation [5l together with 
the prior probabilities P(m), define a Bayesian model 
over interaction sequences with hypotheses m e M. 

The behavior of P can be described as follows. At 
any given time t, P maintains a mixture over systems 
P TO . The weighting over them is given by the mixture 
coefficients w m . Whenever a new action at or a new ob- 
servation is produced (by the agent or the environment 
respectively), the weights w m are updated according to 
Bayes' rule. In addition, P issues an action at sug- 
gested by a system P m drawn randomly according to 
the weights w t - 

However, there is an important problem with P that 
arises due to the fact that it is not only a system that 
is passively observing symbols, but also actively gen- 
erating them. Therefore, an action that is generated 
by the agent should not provide the same information 
than an observation that is issued by its environment. 
Intuitively, it does not make any sense to use one's own 
actions to do inference. In the following section we il- 
lustrate this problem with a simple statistical example. 

The Problem of Causal Intervention 

Suppose a statistician is asked to design a model for 
a given data set T> and she decides to use a Bayesian 
method. She computes the posterior probability density 
function (pdf ) over the parameters 9 of the model given 
the data using Bayes' rule: 



p(0\V) 



p{V\9)p{9) 



J p(V\9')p(9') d9'' 

where p(T>\9) is the likelihood of T> given 9 and p(9) 
is the prior pdf of 9. She can simulate the source by 
drawing a sample data set S from the predictive pdf 

p{S\V) = ( p(S\V,9)p(9\V)al9, 



where p(S\T>, 9) is the likelihood of iS given T> and 9. 
She decides to do so, obtaining a sample set S' . She 
understands that the nature of S' is very different from 
T>: while T> is informative and does change the belief 
state of the Bayesian model, S' is non-informative and 
thus is a reflection of the model's belief state. Hence, 
she would never use S' to further condition the Bayesian 
model. Mathematically, she seems to imply that 

p(9\V,S')= P (9\V) 

if S' has been generated from p{S\D) itself. But this 
simple independence assumption is not correct as the 
following elaboration of the example will show. 

The statistician is now told that the source is waiting 
for the simulation results S 1 in order to produce a next 
data set T>' which does depend on S 1 . She hands in S 1 
and obtains a new data set V . Using Bayes' rule, the 
posterior pdf over the parameters is now 

p(V>\V,S\9)p(V\9)p(9) 



where p(V'\V,S' ,9) is the likelihood of the new data 
T>' given the old data T>, the parameters 9 and the sim- 
ulated data S' . Notice that this looks almost like the 
posterior pdf p(9\D 1 S' , V) given by 

p(V'\V,S',9)p(S'\V,9)p(V\9)p(9) 

J p(V'\V, S', 9')p{S'\V, 9')p{V\9>)p{9>) d9' 

with the exception that now the Bayesian update con- 
tains the likelihoods of the simulated data p(S'\T>, 9). 
This suggests that Equation [7] is a variant of the poste- 
rior pdf p(9\V, S' , V) but where the simulated data S 1 
is treated in a different way than the data T> and V . 

Define the pdf p' such that the pdfs p'{9), p'(V\9), 
p'(T>'\V,S' ,9) are identical to p{9), p(V\9) and 
p(V'\V,S',9) respectively, but differ in p'(S\T>,9): 



p'(S\V,9) = 



if S' 
else. 



That is, p' is identical to p but it assumes that the value 
of S is fixed to S' given V and 9. For p' , the simulated 
data S' is non-informative: 

-\og 2 p{S'\V,6) = Q. 

If one computes the posterior pdf p'(9\T>, S' , V), one 
obtains the result of Equation [7] 

p'{V'\V, S', 9)p'(S'\V, 9)p'(V\9)p'{9) 

J p'{V'\V, S', 9')p'{S'\V 1 9 , )p'(V\9')p'{9') d9> 

p(V'\V,S',9)p(V\9)p(9) 

~ J p(V'\V, S', 9')p{V\9')p{9') aW' ' 

Thus, in order to explain Equation [7] as a posterior pdf 
given the data sets T>, V and the simulated data S' , 
one has to intervene p in order to account for the fact 
that S' is non-informative given T> and 9. 

In statistics, there is a rich literature on causal in- 
tervention. In particula r, we will use the formalism 
developed bv lPearll [2000j |. because it suits the needs to 
formalize interactions in systems and has a convenient 
notation — compare Figures & b. Given a causal 
mode^ variables that are intervened are denoted by a 
hat as in S. In the previous example, the causal model 
of the joint pdf p(9, V, S, V) is given by the set of con- 
ditional pdfs 

C P - {p(9),p(V\9),p(S\V,9),p(V'\V,S,9)}. 

If T> and V are observed from the source and S is in- 
tervened to take on the value S' , then the posterior pdf 
over the parameters 9 is given by p{9\T>, S', V) which 
is just 

p{V'\V,S',9)p(S'\V,9)p{V\9)p{9) 
J p(V'\V, S', 9>)p(S'\V, 9')p{V\9')p{9') d9' 
_ p(V'\V,S\9)p(V\9)p(9) 
J p{V'\V, S', 9 l )p(V\9')p{9') d9' ' 



/ p{V'\V,S' ,9')p(V\9')p(9') d9' 



(7) 



1 For our needs, it is enough to think about a causal model 
as a complete factorization of a probability distribution into 
conditional probability distributions representing the causal 
structure. 




D=D' S=S } 

Figure 1: (a-b) Two causal networks, and the result of conditioning on D = D' and intervening on S = S' . Unlike 
the condition, the intervention is set endogenously, thus removing the link to the parent 6. (c-d) A causal network 
representation of an I/O system with four variables 01010202 and latent variable to. (c) The initial, un-intervened 
network, (d) The intervened network after experiencing &1O1&2O2. 



because p(W\V,S',6) = p(V'\V,S',9), which corre- 
sponds to applying rule 2 in Pearl's intervention cal- 
culus, and because p(S'\V,0') = p'(S'\V, 6') = 1. 

Adaptive Systems: Causal Construction 

Following the discussion in the previous section, we 
want to construct an adaptive agent P by minimizing 
the KL-divergence to the P m , but this time treating 
actions as interventions. Based on the definition of the 
conditional probabilities in Equation we construct 
now the KL-divergence criterion to characterize P us- 
ing intervention calculus. Importantly, interventions in- 
dex a set of intervened probability distribution derived 
from an initial probability distribution. Hence, the 
set of fixed intervention sequences of the form a±a2 ■ ■ ■ 
indexes probability distributions over observation se- 
quences oxo%.... Because of this, we are going to con- 
struct a set of criteria indexed by the intervention se- 
quences, but we will see that they all have the same 
solution. Define the history-dependent intervened KL- 
divergences over the action at and observation o t as 

/ a » „, . . , P(at\m, ao^t) 

C%(ao <t ) ee ^P( at |TO,oo <t )log 2 1J/ 

<— fr(a t \ao <t ) 



Cm{ao<t^t) = ^2 P{o t \m,ao <t a t ) log. 



P(o t \m,ao <t at) 
Pr (o t \ao <t at) 



where Pr is a given arbitrary agent. Note that past 
actions are treated as interventions. Then, define the 
average KL-divergences over a t and o t as 

C£ = 52P(QO <t \rn)C%(ao <t ) 

ao <t 

C°l= Y, P{m<Mm)C%{ao <t h)- 



Finally, we define the total expected KL-divergence of 
P to P m as 



We construct the agent P as the system that minimizes 
C = C(Pr): 

P = argminC(Pr). (9) 

Pr 

The solution to Equation [9] is the system P defined by 
the set of equations 

P(a t |oo<t) = P{a t \ao <t ) 

= Y P(at\m,ao <t )v m (ao <t ) 



P(o t \aq <t a t ) = P{ot\ao <t a t ) 

= Y P{ot\m,aq <t a t )v m (aq <t a t ) 



(10) 



valid for all ao <t G 2*, where the mixture weights are 

P(m)P(ao <t \m) 



v m {ao < t a t) = v m {ao <t ) 



£ m , P(m')P(ao <t \m) 



C EE km SUp Y P(m) Y ( C ™ + C °rrl 



(8) 



P( m ) nUi P{or\m, ao <T a T ) 

E m ' p { m ') nt^i P(o T \m',ao <r a T ) 

(11) 

The proof follows the same line of argument as the 
solution to Equation [2] with the crucial difference that 
actions are treated as interventions. Consider without 
loss of generality the summand J2 m P( m )^m m Equa- 
tion [8j Note that the KL-divergence can be written as 
a difference of two logarithms, where only one term de- 
pends on Pr that we want to vary. Therefore, we can 
integrate out the other term and write it as a constant 
c. Then we get 

c- y p ( m ) Y p (^<ti m ) 

m ao <t 

■ P(at\m,ao <t ) lnPr(a t |oo <t ). 

at 

Substituting P(aq <t \m) by P(m\ao <t )P(ao <t ) / P(m) 
and identifying P characterized by Equations flOl and [TT1 
we obtain 

c~y p (<±°<t) X! p ( a *i^<t) lnPr ( a *i^<«)- 



The inner sum has the form — J2 x p( x ) ^ n( l( x )i i- e - the 
cross-entropy between q(x) and p(x), which is mini- 
mized when q{x) = p(x) for all x. By choosing this 
optimum one obtains Pr(at\ao <t ) = P(at\ao <t ) for all 
a t . Note that the solution to this variational problem is 
independent of the weighting P(ao <t ). Since the same 
argument applies to any summand P(m)C^ and 
Em p ( m )^'m m Equation^ their variational problems 
are mutually independent. 

The behavior of P differs in an important aspect from 
P. At any given time i, P maintains a mixture over sys- 
tems P m . The weighting over these systems is given by 
the mixture coefficients v m . In contrast to P, P updates 
the weights v m only whenever a new observation ot is 
produced by the environment respectively. The update 
follows Bayes' rule but treating past actions as inter- 
ventions, i.e. dropping the evidence they provide. In 
addition, P issues an action a t suggested by an system 
m drawn randomly according to the weights v m — see 
Figures [TJ: & d. 

If we use the following equalities connecting the 
weights and the intervened posterior distributions 

v m (ao <t ) = P{m\ao <t ) = P{m\ao <t a t ) = v m (ao <t a t ) 

and substitute interventions by observations in the con- 
ditionals 

P(a t \m 7 ao <t ) = P(a t \m,ao <t ) 
P(o t \m,ao <t a t ) = P(o t \m, ao <t a t ) 

which corresponds to rule 2 of Pearl's intervention cal- 
culus, we can rewrite Equations [TO] and [TT] as 



P(a t \ao <t ) = P(a t \ao <t ) 

P(at\m, ao <t ]P(m\ao <t ] 



(12) 



P(o t \aq <t a t ) = P(o t \ao <t a t ) 

= ^^2P(o t \m,ao <t a t )P(m\ao <t ) (13) 



where the intervened posterior probabilities are 



P(m\ao <t ) 



P ( m ) Ilt=i p (°r\m,aq <T a T ) 
E m ' P ( m nt=i P(o T \m',ao <T a T ) 



(14) 

Equations [T2l [T3l and[T4l are important because they de- 
scribe the behavior of P only in terms of known proba- 
bilities, i.e. probabilities that are computable from the 
causal model associated to P given by 

Cp = {P(m), P(a t \m, ao <t ), P(p t \m, ao<t a t) ■ t > l}. 

Importantly, Equation[T2"ldescribes a stochastic method 
to produce desirable actions that differs fundamentally 
from an agent that is constructed by choosing an opti- 
mal policy with respect to a given utility criterion. We 
call this action selection rule the Bayesian control rule. 



Experimental Results 

Here we design a very simple toy experiment to illus- 
trate the behavior of an agent P based on a Bayesian 
mixture compared to an agent P based on the Bayesian 
control rule. 

Let Qo, Qi, Po and Pi be four agents with binary 
I/O sets A = O = {0, 1 } defined as follows. Pi is such 
that Pi(a t \ao <t ) = Pi(a t ) and Pi(o t \ao<t a t) = p i(ot) 
for all aq <t £ Z*, where 

„ ( , f 0.1 if a t = . (OA if at = 

Pl(a * ) = \0.9 ifa t = l' Pl(ot) = \0.6 ifa t = r 

Let Po be such that 

P (a t \ao < t) = 1 - Pi( a t|oo< t ) 
Po(ot\aq <t a t ) = 1 - Pa(ot\qo <t at) 

for all ao <t G Z* . Thus, P and Pi are agents that 
are biasecTtowards observing and acting 0's and l's re- 
spectively. Furthermore, Qo = Po and Qi = Pi. As- 
sume a uniform distribution over Q = {Qo,Qi}, i-e. 
P(m = 0) = P(m = 1) = |. 

Assume Qo G Q is drawn. In this case, one wants 
the agents P and P to minimize the deviation from Po- 
Consider the following instantaneous measure 

+ E P oK)log 2 pr ,, P ° K) , 
, Pr{o't\ao <t at) 

°t 

where a\0\ai02 ... is a realization of the interaction sys- 
tem (Pr, Qo). d{t) measures how much Pr's action and 
observation probabilities deviate from Po at time t. 

Recall that both P and P maintain a mixture over 
Po and Pi. The instantaneous I/O probabilities of such 
a system can always be written as 

wP (at) + {l-w)P 1 (a t ) 
wPo(ot) + (1 - w)P 1 {o t ). 

where w G [0, 1]. Thus, it is easy to see that the in- 
stantaneous I/O deviation takes on the minimum value 
when w — 1 and the maximum value when w = 0: 
In the case w = 1, d(t) = bits; In the case w = 0, 
d{t) w 2.653. 

We have simulated realizations of the instantaneous 
I/O deviation using the agents P and P. The results 
are summarized in Figure 2. For P, d(t) happens to 
be non-ergodic: it either converges to d{t) — » or to 
d{t) 2.654, implying that either P ->• P or P -> 
Pi respectively. In contrast, d(t) — > always for P, 
implying that P — > Po- 

Analogous results are obtained when Qi G Q is 
drawn instead: For P, d(t) converges either to or to 
« 2.654, whereas for P, d(t) — 2.654 always imply- 
ing that P — > Pi. Hence, P shows the correct adaptive 
behavior while P does not. 




5 10 15 20 25 50 100 150 200 250 

t t 



Figure 2: 10 realizations of the instantaneous deviation d(t) for the agents P (left panel) and P (right panel). The 
shaded region represents the standard deviation barriers computed over 1000 realizations. Since d(t) is non-ergodic 
for P, we have separated the realizations converging to from the realizations converging to w 2.654 to compute the 
barriers. Note that the time scales differ in one order of magnitude. 



Conclusions 

We propose a Bayesian rule for adaptive control. The 
key feature of this rule is the special treatment of ac- 
tions based on causal calculus and the decomposition of 
agents into Bayesian mixture of I/O distributions. The 
question of how to integrate information generated by 
an agent's probabilistic model into the agent's informa- 
tion state lies at the very heart of adaptive agent design. 
We show that the naive application of Bayes' rule to I/O 
distributions leads to inconsistencies, because outputs 
don't provide the same type of information as genuine 
observations. Crucially, these i nconsi s tencie s vanish if 
intervention calculus is applied [Pearl |2000| . 

Some of the presented key ideas are not unique to the 
Bayesian control rule. The idea of representing agents 
and environments as I/O streams has been proposed 
by a number of other appr oaches, such as predictive 
state representation (PSR) 
the universal AI approach by 



a pr 

Littman et all [2002j and 



Hutted [2004^ The idea of 



breaking down a control problem into a superposition 
of controllers has been previously evoked in the con- 
text of "mixt ure of experts " -mode ls like the MOSAIC- 
architecture lHaruno et al.l [200 lj . Other stochastic 
action selection approach es are found in exploration 
strategi es for (PO)MDPs IWvattl, Il997l learning au- 
tomata [Narendra and Thathachar, 1974] and in prob- 
ability matching [R.O. Dudal . |2001| amongst others. 
The usage of compression principles to select actions 
has been proposed by AI researchers, for example 
ISchmidhuberl [2009] . The main contribution of this pa- 
per is the derivation of a stochastic action selection and 
inference rule by minimizing KL-divergences of inter- 
vened I/O distributions. 

An important potential application of the Bayesian 
control rule would naturally be the realm of adaptive 
control problems. Since it takes on a similar form to 
Bayes' rule, the adaptive control problem could then 
be translated into an on-line inference problem where 
actions are sampled stochastically from a posterior dis- 
tribution. It is important to note, however, that the 
problem statement as formulated here and the usual 



Bayes-optimal approach in adaptive control are not the 
same. In the future the relationship between these two 
problem statements deserves further investigation. 
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